Introduction to Data Management in Exact Sciences

Doctoral College of Brittany

Damien Belvèze

University of Rennes

2025-07-04

  • Advice on data management
  • Training (data, reproducibility, identifiers)
  • Support for data management plans
  • Curation of the Univ-Rennes collection on Recherche Data Gouv

ARDoISE data hub

1. Research Data, What Are We Talking About?

Figure 1: données brutes, données raffinées

Which Files Are Important to Make Available?

  • raw_data_fish_counter.csv
  • intermediate_data.xls
  • filter1.py
  • first_draft_submission.pdf
  • fish_counter_calibration.md
  • kick_off_report.docx
  • filter2.py
  • notebook_experiment.ipynb
  • final_data_fish_counter.xls
  • project_presentation_funders.pptx
  • final_data.csv
  • study_draft.qmd
  • january_meeting_partners.docx
  • fish_counter_instructions_for_use.pdf
  • gantt_calendar.xlsx

Answers

  • raw_data_fish_counter.csv
  • intermediate_data.xls
  • filter1.py
  • first_draft_submission.pdf
  • fish_counter_calibration.md
  • kick_off_report.docx
  • filter2.py
  • notebook_experiment.ipynb
  • final_data_fish_counter.xls
  • project_presentation_funders.pptx
  • final_data.csv
  • study_draft.qmd
  • january_meeting_partners.docx
  • fish_counter_instructions_for_use.pdf
  • gantt_calendar.xlsx

2. Towards Cumulative, Reliable, and Reproducible Science?

Figure 2: static data to linked data

Permanence of Data Access

struggling with data loss 📓 Gibney & Van Noorden (2013)

3. A Challenge of Open Science

FAIR Principles

Figure 3: principes FAIR

openness / closure

  • “as open as possible, as closed as necessary”

  • Default openness

  • Closure to justify:

    • personal data
    • intellectual property

Making Your Data Findable

Quality of a directory:

  • reputation
  • sustainability (institutional support)
  • open license
  • persistent identifier
  • richness of metadata
  • curation

discipline repository
images (SHS) MediHal
code Software Heritage via HAL
Bioinformatics GenOuest
Humanities Nakala
Mathematics no repository, see with the RNBM group
environment, hydrology Data Indores
Earth Sciences data terra
Marine Sciences data ifremer, seanoe
medical sciences INSERM repository on RDG
Ecology, Environment, and Society Data InDoRES and Cat.InDoRES

Recherche Data Gouv

  • richness of metadata
  • curation
  • national reference (supported by the Ministry)
  • persistent identifier
  • significant volume
  • free of charge
  • simplified generation of datapapers
  • RDG sandbox

Are Data Accessible?

In 93% of cases no response or negative response without justification 📓 Gabelica et al. (2022)

Figure 4: “data available upon request”

Are Data Interoperable?

Which identifiers to use for copper telluride?
registry identifier
CAS number 12019-52-2
PubChem CID number 6914517
PubChem SID number 24879035
openSMILES identifier CuCu.CuCu.TeTe
InChI identifier InChI=1/2Cu.Te
MDL number MFCD00049727

Transparent Formats?

Figure 5: CSV vs XLS

📓 Ziemann et al. (2023)

Documenting the Data

  • Documentation is the glue that binds a data science project together 📓 Ziemann et al. (2023)
  • Carefully describe the data and the context of its acquisition (production, collection)
  • literate programming
  • describe the data using ontologies

Documenting to Avoid Context Errors

Be precise in describing the context of data production

Figure 6: the importance of data context

Ontologies

discipline thesaurus
biodiversity INRAE
environment GEMET
Biology, Health MeSH
Mental Health ascodopsy

directory of thesauri

Reusable Data?

  • Creative Commons (CC:by)
  • a license written by a law firm expert in intellectual property that provides for a variety of authorized or prohibited use cases
  • ODBL
  • Etalab
  • no license, do whatever you want with my dataset
  • CC0
  • CC:by for everyone except for fossil industries, arms sellers, and Google (📓 Thomas (2023)) text available here.

Data Management Plan

  • The DMP summarizes all the choices made for data management
  • Submit an initial version of a DMP 6 months after signing a contract (ANR, European projects)
  • DMP OPIDOR

4. Let’s Get Practical

Figure 7: Silurus Glanis

fictional data (ChatGPT)

   Day January February March April May June July August September October
1    1       5        3     8     4   2   10   11     12         9       7
2    2       3        7     2     5   4    8    1      6        11      10
3    3       6        4     9     7  11    3    8     5#         2       1
4    4       8        1     0     3   0    9   4#      7         5       2
5    5       2       10     7    12   8    4   11      1         6       3
6    6       4        0     3     1   5    7    2     10         0      12
7    7       7        5     1     9  10    2    6      3         4      11
8    8      11       0*     6     2   3    1    7      9        12      **
9    9       1       9*     4    11  7*    5   10     2#         3      **
10  10       9       2*    10     6  1*   11   3#      8         7      **
11  11       7       6*     5     1  9*   2#   4#     11         8      **
12  12       4      11*     1     8  5*    6    9      7        10      **
13  13      12       7*     2     4 11*    3    5      6         9      **
14  14       8       5*     3     7  6*   10   1#     2#         4      **
15  15       2       0*    10    12   4    8    6      9         0      **
16  16       6        1     8     2   7    4   11      3        10      **
17  17       3        4     7     5   1    9   2#     10         6      **
18  18       5        9     6    11   3    1   10      8         2      **
19  19       7        6     0     4  10   12   5#      1         8       3
20  20      11        2     9     3   8    5    6      4         1       7
21  21       1        7     3     5   6    4    8     11        10       2
22  22       0        3     8     4   2   10   11     12         0       7
23  23       3        7     2     5   4    8    1      6        11      10
24  24       6        4     9     0  11    3    8      5         2       1
25  25       8        1    11     3   6    9    4      7         5       2
26  26       2       10     7    12   0    4   11      1         6       3
27  27       4        6     0     1   5    7    2     10         8      12
28  28       7        5     1     9  10    2    6      3         4      11
29  29      11      N/A     6     2   3    1    7      9        12       4
30  30       1      N/A     4    11   7    5   10      2         3       8
31  31       9      N/A    10   N/A   1  N/A    3      8       N/A       5
   November December
1         6        1
2         9       12
3        10        6
4         8        4
5         0        5
6         4        9
7         1        7
8         5        2
9        11       10
10        3        1
11        6        4
12        2        8
13        4        3
14        1        5
15        0        2
16        8       11
17        3        7
18        7        1
19        2        6
20        9       10
21        4        9
22        6        1
23        9       12
24       10        6
25        8        4
26        7        5
27        4        9
28        1        7
29        5        2
30       11       10
31      N/A        1

Dear Prof. Armand,

I'm enclosing the data collected this year by our various sensors installed on the Tydale as part of your study "Growth of glane silure catfish *silurus glanis* in european river, the case of the Tydale river". 
Funds from the Royal Fisheries Corporation (RFC) enabled the purchase of 8 underwater sensors along the Tydale, which, when properly parameterized, were able to count only fish weighing over 10 kg. Thanks to the camera's artificial intelligence, catfish were counted with a margin of error of around 3%. 
We have noted certain incidents in the data obtained that could affect the proper conduct of the study. 
In February and May, we noted that some sensors were no longer working properly and had to be repaired. In October, the centralized results collection system broke down for 11 days, causing us to lose data. 
In addition, we pointed out that boating activities on the Tyndale River on certain days in June and July could disturb the catfish, which were therefore less likely to be present on those days. 

I hope that these figures will nevertheless enable you to progress your study and submit your publication before spring. 

Best wishes for the rest of the day, 

Mickael. J. Bernache, Biodiversity Research Institute of Portland

Figures

figure credits
Figure 1 Maricx
Figure 2 Tim Berners-Lee
?@fig-perte Gibney, Van Noorden
Figure 3 Willkinson, Dumontier et al.
Figure 4 Sergio Uribe
Figure 5 meme dont l’origine se perd dans la nuit des temps
Figure 6 Ralph Aboujaoude Diaz
Figure 7 Dieter Florian

Software Used for the Presentation

Except for its translation from french to english, which was made with the help of ChatGPT, the presentation was created with free software (thank you Richard M. Stallman), including the following software:

Quarto 1.3.450
VScode 1.8.0
R :
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.0 (2023-04-21 ucrt)
 os       Windows 11 x64 (build 22631)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  French_France.utf8
 ctype    French_France.utf8
 tz       Europe/Paris
 date     2024-04-16
 pandoc   3.1.11 @ C:/PROGRA~1/Pandoc/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 cli           3.6.2   2023-12-11 [1] CRAN (R 4.3.2)
 digest        0.6.34  2024-01-11 [1] CRAN (R 4.3.2)
 evaluate      0.23    2023-11-01 [1] CRAN (R 4.3.2)
 fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.2)
 htmltools     0.5.7   2023-11-03 [1] CRAN (R 4.3.2)
 jsonlite      1.8.8   2023-12-04 [1] CRAN (R 4.3.2)
 knitr         1.45    2023-10-30 [1] CRAN (R 4.3.2)
 rlang         1.1.3   2024-01-10 [1] CRAN (R 4.3.2)
 rmarkdown     2.25    2023-09-18 [1] CRAN (R 4.3.0)
 rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.2)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.2)
 xfun          0.42    2024-02-08 [1] CRAN (R 4.3.3)
 yaml          2.3.8   2023-12-11 [1] CRAN (R 4.3.2)

 [1] C:/Program Files/R/R-4.3.0/library

──────────────────────────────────────────────────────────────────────────────

the text editor is VScode. VScode makes using Quarto easier (Quarto is a software without a graphical interface, as is Pandoc, which is integrated into Quarto). VScode also allows encapsulation of R chunks (if R is loaded on the machine with the corresponding packages)

The session info package in R allows showing how data manipulation code can be automatically documented to produce a figure from a dataset, for example. Code reproducibility is an essential issue related to data access.

References

Gabelica, M., Bojčić, R., & Puljak, L. (2022). Many researchers were not compliant with their published data sharing statement: Mixed-methods study. Journal of Clinical Epidemiology, 0(0). https://doi.org/10.1016/j.jclinepi.2022.05.019
Gibney, E., & Van Noorden, R. (2013). Scientists losing data at a rapid rate. Nature. https://doi.org/10.1038/nature.2013.14416
Thomas, M., Éric Tannier. (2023, May 17). Se réapproprier la production de connaissance - AOC media. AOC media - Analyse Opinion Critique. https://aoc.media/opinion/2023/05/17/se-reapproprier-la-production-de-connaissance/
Ziemann, M., Poulain, P., & Bora, A. (2023). The five pillars of computational reproducibility: Bioinformatics and beyond. Briefings in Bioinformatics, 24(6), bbad375. https://doi.org/10.1093/bib/bbad375